Kernel principal component analysis (PCA) control chart for monitoring mixed non-linear variable and attribute quality characteristics

The products are commonly measured by two types of quality characteristics. The variable characteristics measure the numerical scale. Meanwhile, the attribute characteristics measure the categorical data. Furthermore, in monitoring processes, the multivariate variable quality characteristics may have a nonlinear relationship. In this paper, the Kernel PCA control chart is applied to monitor the mixed (attribute and variable) characteristics with the nonlinear relationship. First, the Average Run Length (ARL) is utilized to evaluate the performance of the proposed chart. The simulation studies show that the proposed chart can detect the shift in process. For this case, the Radial Basis Function (RBF) kernel demonstrates the consistent performance for several cases studied. Second, the performance comparison between the proposed chart and the conventional PCA Mix chart is performed. Based on the results, it is known that the proposed chart performs better in detecting the small shift in process. Finally, the proposed chart is applied to monitor the well-known NSL KDD dataset. The proposed chart shows good accuracy in detecting intrusion in the network. However, it still produces more False Negatives (FN).


Introduction
Two types of control charts have been developed based on the monitored quality characteristics. These charts are named as the attribute and variable charts. The variable control chart is developed to monitor the variable quality characteristics (in variable or ratio scale) such as length, temperature, or height (Montgomery, 2009). Meanwhile, to monitor the attribute quality characteristics (in categorical scale) the attribute chart was applied (Ahsan et al., 2018). When the characteristics quality is correlated or cannot be monitored separately, the multivariate control chat has been developed. There are three main types of multivariate variable control charts namely Shewhart, multivariate exponentially weighted moving average (MEWMA), and multivariate cumulative sum (MCUSUM).
The product quality characteristics are not only gauged individually by the attribute or variable characteristics but also can be monitored using a mixed scheme. In order to facilitate a mixed procedure of the monitoring process, several works have studied the development of the mixed characteristics charts. The mixed scheme by employing the combination between and charts has been proposed and has a good performance in monitoring mixed characteristics (Aslam et al., 2015).

Sources
Proposed scheme Findings Chiang et al. (2021) New scheme of multivariate auxiliary-information-based (AIB) chart The performance of the proposed chart is evaluated using Monte-Carlo simulation and applied to cement data Ahmad and Ahmed (2021) 2 control chart to inspect the high dimensional data The proposed method is usable without preprocessing or dimension reduction with high accuracy detection Haddad (2021) 2 control charts using modified Mahalanobis distance The proposed method has better performance in detecting more outliers compared to the traditional chart Cabana and Lillo (2021) Robust multivariate chart for individual observations using reweighted shrinkage estimators The proposed chart has a better performance for high dimensional and high contaminated data Maleki et al. (2020) Median estimators of the 2 control chart The proposed method outperforms performance compared to the conventional chart Haddad et al. (2019) Bivariate Hotelling's 2 charts with bootstrap data The proposed method shows a better performance compared to the conventional method Tiengket et al. (2020) Bivariate Copulas on the Hotelling's 2 Control Chart The bivariate copulas method can be used in the Hotelling's 2 chart Mashuri et al. (2019) Tr ( 2 ) control charts with Kernel Density Estimation (KDE) control limit The proposed control chart method presents better performance to detect the shift for the large characteristics and sample size Mehmood et al. (2019) Hotelling 2 control chart based on bivariate ranked set schemes Proposed control chart schemes demonstrate an outstanding performance compared to the classical Hotelling 2 Haq and Khoo (2019) Adaptive MEWMA chart The proposed chart surpasses the performances of the existing adaptive multivariate charts Flury and Quaglino (2018) MEWMA chart for asymmetric gamma distributions The proposed MEWMA chart outperforms the performance of the conventional 2 chart in all the cases Haq et al. (2020) Dual MCUSUM charts with auxiliary information for the process mean The proposed chart has a better performance compared to the DMCUSUM and MDMCUSUM charts when detecting different sizes of a shift in the process mean vector The proposed chart outperforms the other charts for most shift domains Mashuri et al. (2020) Fuzzy bivariate chart The proposed chart is more sensitive than the conventional bivariate Poisson chart Zhou et al. (2020) Synthetic control chart for attribute inspection The proposed chart demonstrates a higher detection performance for small and large mean shifts Quinino et al. (2020) Attribute chart for the joint monitoring of mean and variance The proposed method is easier to be implemented compared to the conventional approach Aldosari et al. (2019) Attribute control chart for multivariate Poisson distribution using multiple dependent state repetitive sampling (MDSRS) The proposed method has a better performance than the conventional one based on repetitive sampling Aslam et al. (2019) Shewhart attribute control with the neutrosophic statistical interval The proposed attribute control chart has a good ability to detect a shift in the process Chong et al. (2019) Multi-attribute CUSUM-np chart The proposed procedure has a better or equal performance compared to the conventional chart Aslam (2019) Attribute control chart using the repetitive sampling under the fuzzy neutrosophic system The proposed chart with repetitive sampling under the fuzzy neutrosophic system is more sensitive in detecting a shift in the process as compared with the existing chart Lee et al. (2017) Multinomial generalized likelihood ratio (MGLR) chart The proposed chart has better performance than the set of 2-sided Bernoulli CUSUM charts nent analysis (FKICA-PCA) to monitor multivariate industrial processes. The nonparametric Revised Spatial Rank Exponential Weighted Moving Average (RSREWMA) control chart is developed to assess the multivariate nonlinear profile data (Pan et al., 2019). Kernel PCA can be applied in monitoring such cases mentioned above by using the control chart approach. Based on the previous study, the KPCA Mix chart (Ahsan et al., 2020) can be extended to monitor the multivariate nonlinear data. Therefore, this research suggests a mixed multivariate control chart based on the KPCA algorithm that can accommodate the mixed type of quality characteristics with the nonlinear relationship. The estimated PCs Mix from KPCA are then transformed into Hotelling's 2 statistics. The control limit of 2 statistics is calculated using the kernel density estimation (KDE), the same method used in Ahsan et al. (2020). Moreover, to show the benefits and drawbacks of the proposed chart, its performance is compared with the conventional PCA Mix chart. The rest of this article is arranged as follows: Some related studies are shown in section 2. Section 3 describes the Kernel PCA method. The charting procedures of the proposed KPCA Mix control chart are displayed in section 4. Section 5 presents the performance assessment of the proposed chart in detecting a shift in the process along with the comparison with the PCA Mix chart. The utilization of the proposed chart in simulated and real data is shown in Section 6. Some conclusions and possible future research are presented in Section 7.

Related research
The recent studies of the control charts are presented in this section. There are three main categories of control charts discussed in this section such as a multivariate variable chart, attribute chart, and mixed chart. The recent developments in multivariate variable charts are displayed in Table 1. Table 2 shows the recent developments of multivariate attribute charts. Meanwhile, the recent developments in mixed characteristics are presented in Table 3.
Based on the recent development of the mixed control chart, it can be seen that there are a few works that studied the mixed monitoring variable and attribute characteristics. Therefore, more development in this area is needed especially for nonlinear data. This work proposes the mixed control chart based on the Kernel PCA Mix algorithm. The control limit of the 2 statistics from PCs Mix is estimated using the KDE method which has better performance in estimating the non-normal data. The proposed chart is expected to have better performance to monitor the nonlinear mixed data. To show this, the performance of the proposed chart is compared with the conventional PCA Mix chart. Also, the application to the real data is conducted.

Sources
Proposed scheme Findings Ahsan et al. (2020) Kernel PCA Mix Chart The proposed chart has a better performance compared to the PCA Mix chart Ahsan et al. (2019) PCA Mix chart for detecting outlier in mixed characteristics scheme The proposed chart has a great performance to detect more outliers with a higher percentage of outliers added compared to the conventional and other robust charts Ahsan et al. (2018) PCA Mix control chart The proposed chart presents good performance for an appropriate number of principal components used Wang et al. (2018) Multivariate sign chart Simulations show the superiority of the proposed control chart in monitoring mixed-type data Aslam et al. (2015) The mixed chart to monitor the process The mixed chart shows excellent performance in the monitoring process

Kernel PCA
PCA is the basis of transformation to diagonalize the estimated covariance matrix from input data. PCA was originally proposed for linear data. Therefore, this method is not powerful for nonlinear data. To overcome this nonlinearity problem, Schölkopf et al. (1997) proposed the Kernel PCA scheme.
The basic idea of Kernel PCA is calculating the Principal Component Scores in higher dimensional space by conducting a nonlinear mapping Φ ∶ ℝ → , ↦ as displayed in Fig. 1. This mapping can be executed by utilizing the kernel functions known from the Support Vector Method (SVM) (Boser et al., 1992).
Assume that the centered data are mapped to feature space , Φ( 1 ), ..., Φ( ). The feature space covariance matrix with a size of × can be written as in Equation (3.1).
The next step is estimating the eigenvalues ≥ 0 eigenvector that satisfies Equation (3.2).
= . (3.2) In general, the mapping Φ(.) is not always can be calculated. To solve the problem, the dot product calculation from to vector in feature space is performed. Let with a size of × defined as To solve the eigenvalue problem and principal component calculation, nonlinear mapping is not needed to be conducted. To replace this, the kernel function can be constructed

Statistics calculation
The main concept of the Kernel PCA Mix chart is to form the as a representation of the mixed variable. There are two main steps in the KPCA Mix chart procedure. First, the 2 statistics are computed from matrix . Second, the control limit calculation is performed by applying the KDE. These procedures are illustrated by the flowchart in Fig. 1. Furthermore, detailed procedures are given as follows:

Statistics calculation
1. Create matrix = [ 1 , 2 ] sized × ( + ) where: a. 1 is the centered version of a matrix 1 which is contained the variable characteristics (numeric data). b. 2 is the centered version of a matrix which is contained the dummy from each category in attribute characteristics (categorical data) 2 . 2. Define = 1 , where is the identity matrix with the size of × .
, where the first columns are specified as by 1 and the last columns are weighted by , for Calculate Principal Component Scores (PCs) using the formula as shown in Equation (4.4).
7. From the first principal component , calculate the 2 statistics using Equation (4.5).

Control limit calculation
The control limit is estimated using the KDE approach due to its ability to follow the unknown distribution of data input. The procedures of control limit calculation are presented as: 1. Estimate the empirical density of ̃2 statistics using Equation (4.6). .
where min and max are the maximum and minimum values of ̃2 . 3. Calculate the control limit using the expression as shown in Equation (4.8). (4.8)

Simulation set-up
The performance of the proposed control chart is assessed for the variable characteristics (numeric data) which have a nonlinear relationship. The nonlinear data is generated using the following procedures: 1. Generate vector 0 ∼ (0, 1) and 0 ∼ (0, 01, 1).

Define five nonlinear variable characteristics as:
The visualizations of those five generated characteristics are presented in Fig. 2.

Performance evaluation
The number of variable quality characteristics 1 (generated from the Multivariate Normal distribution) involved is five. Meanwhile, the number of principal components evaluated is 2, 3, and 4. The performance is evaluated for three cases, namely the case of attribute characteristics 2 (generated from the Multinomial distribution) with extreme imbalanced, imbalanced, and balanced proportions as defined below: a. Balanced case with parameter 1 , 2 = 0.3 and 3 = 0.4 b. Imbalanced case with parameter 1 , 2 = 0.1 and 3 = 0.8 c. Extreme Imbalanced case with parameter 1 , 2 = 0.05 and 3 = 0.9 Furthermore, three categories of kernel functions utilized in this research are defined as follows:

Extreme imbalanced case
The performance of the Kernel PCA Mix chart in handling nonlinear data with an extreme imbalanced proportion of attribute characteristics is tabulated in Tables 4-6. For the small number of the principal   Tables 7-9 show the Kernel PCA Mix chart performance in inspecting the nonlinear for an extreme imbalanced proportion of attribute char- acteristics. Similar to the previous results, the control limit produces stable ARL 0 at about 370. For all number of principal component scores used, the RBF kernel has a preferable performance compared to the other functions. It is also known that the linear kernel displays poorer results in this case.

Balanced case
Kernel PCA Mix chart performance in assessing the nonlinear data with a balanced proportion of attribute characteristics is displayed in Tables 10-12. Similar to the previous results, the control limit produces consistent ARL 0 at about 370. The RBF kernel performs better compared to the others for all number of principal component scores used. Also, the RBF kernel reaches its peak performance when inspecting the balanced proportion of attribute characteristics. For this case, the Polynomial and Linear kernel functions have similar performance.

Comparison with PCA Mix chart
The Kernel PCA Mix performance chart is compared with the performance of the PCA Mix chart in inspecting the nonlinear data. The performance comparisons for extreme imbalanced, imbalanced, and balanced cases are tabulated in Tables 13-15, respectively. Meanwhile, the visualizations of these comparisons are displayed in Figs. 3-5.

Discussion
In this subsection, some discussion about the performance of the proposed chart is provided. First, the best kernel used is the RBF kernel. This happened because the other kernel is developed based on a linear kernel. As we know that the process is generated to follow the nonlinear relationship. The RBF kernel is renowned to have a better performance in inspecting the nonlinear process and under general smoothness assumptions (Zhicheng et al., 2012). Therefore, it makes sense that the RBF kernel performs better in this study. Table 16 tabulates the summary of the performance comparison between the Kernel PCA Mix chart and PCA Mix chart. In general, both charts yield good performance in detecting the process shift. However, for the specific case, the Kernel PCA Mix chart demonstrates better performance for the small process shift. Meanwhile, the PCA Mix chart has a better performance for a large shift in process. This result indicates that the proposed method is better to be used for nonlinear data with a small shift. This happened because the PCA Mix chart is only developed for the linear process. In contrast, the proposed Kernel PCA Mix chart is developed to overcome the nonlinearity problem so that it has good performance.

Application to the real data
In this section, the Kernel PCA Mix chart is applied to monitor intrusion in the real dataset. The dataset used is the famous NSL KDD. This research only analyzes 20% of the NSL KDD dataset which can be found at https://www .unb .ca /cic /datasets /nsl .html. The summary of this dataset is displayed in Table 17. From Fig. 6, it is known that the normal connection of the NSL KDD dataset is not normally distributed. The RBF kernel is used in this analysis due to its performance consistency in simulation studies. Table 18 shows the accuracy rate of the Kernel PCA Mix chart in detecting intrusion in the NSL KDD dataset for several principal component scores. From the results, it is seen that the optimal number of principal components is 4. After finding the optimal number of principal components, this analysis is continued by searching for the optimal value of . Based on the result in Table 19, it can be known that the optimal value of is 0.001. From the detection results, it can be seen that the proposed method has a detection accuracy of about 0.85769. The misdetection happens due to the large value of the FN rate which indicates that more attacks cannot be accurately detected as the real attack.
The performance comparison with the other methods is shown in Table 20. The proposed method is compared with several machine learning algorithms (Decision Tree, Naïve Bayes, Logistic Regression, and Support Vector Machine) and control chart method (Hotelling's 2 and PCA Mix chart). According to the table, it is clear that the proposed method has higher accuracy compared to the other machine learning methods and control chart method for the same number of quality characteristics monitored. Also, we can see that the proposed method yields     a lower FP rate. This is indicating that the proposed method produces a lower false alarm.

Conclusion and future research
In this research, the control chart which has the ability in monitoring the mixed variable and attribute characteristics with nonlinear relation-    (Farid et al., 2014) 0.8192 0.1740 Hybrid Naïve Bayes (Farid et al., 2014) 0.8239 0.1640 Logistic Regression (Belavagi and Muniyal, 2016) 0 ships is proposed. The performance of the proposed chart is evaluated for several types of attribute characteristics and several kernel functions. Through simulation studies, it can be seen that the Kernel PCA Mix chart can detect the shift in process. It also can be known that the better kernel function is RBF due to its consistency in detecting a shift in process. The comparison with the PCA Mix chart shows that the proposed chart has better performance for a small shift in the process. On the other hand, the PCA Mix chart has better performance for a large shift. This method can be applied in monitoring the process with a nonlinear relationship such as in manufacture and industry, chemical process, biological process, and network anomaly detection. Furthermore, the proposed chart is also applied to monitor the real dataset. The well-known NSL KDD dataset is used as the benchmark for the proposed chart. The monitoring results show that the proposed chart has a good accuracy detection at about 0.85769. Compared to the other methods the proposed demonstrates a better performance by producing higher accuracy and lower false alarms. For future research, the Generative Principal Component Analysis (K. Liu et al., 2020Liu et al., , 2021 can be used in order to improve the performance of the proposed method. Also, the Bayesian-based PCA method (Y. Liu et al., 2018) can be applied for imbalanced cases.

Author contribution statement
Muhammad Ahsan: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Wrote the paper.
Muhammad Mashuri: Conceived and designed the experiments; Wrote the paper.
Hidayatul Khusna: Analyzed and interpreted the data; Wrote the paper.
Wibawati: Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data.

Data availability statement
Data associated with this study is available at https://www .unb .ca / cic /datasets /nsl .html.

Declaration of interests statement
The authors declare no conflict of interest.